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Abstract: 

We study the problem of two-sample comparison with categorical data when the 
contingency table is sparsely populated. In modern applications, the number of cat- 
egories is often comparable to the sample size, causing existing methods to have low 
power. When the number of categories is large, there is often underlying structure 
on the sample space that can be exploited. We propose a general non-parametric 
approach that utilizes similarity information on the space of all categories in two 
sample tests. Our approach extends the graph-based tests of [Friedman and~R afsky 



1979 and Rosenbaum 2005 , which are tests base on graphs connecting obser- 
vations by similarity. Both tests require uniqueness of the underlying graph and 
cannot be directly applied on categorical data. We explored different ways to ex- 
tend graph-based tests to the categorical setting and found two types of statistics 
that are both powerful and fast to compute. We showed that their permutation 
null distributions are asymptotically normal and that their p- value approximations 
under typical settings are quite accurate, facilitating the application of the new 
approach to real problems. The application of this new approach is illustrated 
through several examples. 

Key words and phrases: Two-sample tests, categorical data, discrete data, minimum 
spanning trees, graph-based tests, contingency table. 



1 Introduction 

Testing whether two data samples are drawn from the same distribution is a 
fundamental problem in statistics. For low-dimensional Euclidean data, there are 
many approaches, both parametric and non-parametric, to this problem. When 
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the data are categorical, the existing approaches are much more limited. The 
standard procedure is to assume that each sample is drawn from a multinomial 
distribution, and the comparison becomes a test of whether the two samples come 
from the same multinomial distribution. Classical methods, such as the Pearson's 
Chi-square test and the deviance test, work well when we observe each category a 
large number of times. At least, the region in the contingency table where the two 
groups truly differ needs to be adequately sampled for existing tests to achieve 
good power. However, in many modern applications, the number of possible 
categories is comparable to or even larger than the sample size. Following are 
some examples: 

Preference rankings: Survey data in marketing or psychometric research often 
come in the form of preference rankings. Subjects may be asked to rate 
wine (rank from best to worst tasting), pictures (choose 3 most familiar 
out of 5), or insurance plans (identify the most and least desirable). See 
Diaconis] |1988 and Critchlow |1985| for more detailed examples on ranked 



and partially ranked data. It is a common problem to compare two groups 
of subjects to see if there is any between-group difference in preference. The 
number of possible full rankings is the factorial of the number of objects 
being rated, and the number of possible rankings is higher if some subjects 
only partially rank the objects. 

Haplotype association: In genetics, a haplotype is a combination of alleles 
at adjacent loci on a chromosome that is transmitted together. A com- 
mon problem of genetic association studies is to compare haplotype counts 



between treatment and control groups (e.g. see Zaykin et al. 2002 and 



Furihata et al. 2006| ). Each haplotype can be represented as a fixed- length 



binary vector. The number of possible haplotypes is exponential in the 
number of loci. Haplotypes that are longer than 10 are often of interest in 
genetics, leading to > 1,000 possible combinations. However, the number 
of subjects in association studies is often only in the thousands or even 
hundreds, and the counts for most haplotypes are small. 



Sequence or document comparisons: In the modern age of digitized texts, 
it is often of interest to compare the word composition in two different doc- 
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uments. A similar problem is the comparison of DNA or protein sequences, 



which plays a large role in bioinformatics Lippert et al. , 2002 . The number 
of possible words in these applications can be very large, while the counts 
for most words are small. For recent interest in this problem see |Perry and 



Beiko |2010 , Bush and Lahn |2006 and Rajan et al. 2007| for examples. 



Classical Chi-square tests have low power in the above scenarios due to 
sparsity of the contingency table and high dimensionality of the parameter space. 
For exact tests, it is possible to generalize the concept to the setting of more than 
two categories, but this is computationally challenging [Mehta and Patel" 1983 



and not efficient due to the existence in high dimensions of many equivalent 
tables, which are tables that have the same probability as the one observed. 

When the number of categories is very large, there is often underlying simi- 
larity between different categories that can be exploited. For example, rankings 
can be related through Kendall's or Spearman's distance. Hamming distance 
or other more sophisticated measures can be used to compare haplotypes and 
fixed-length words in DNA sequences. In document comparison, the similari- 
ties between words are not equally likely: Some words are synonyms of others; 
Some are more likely to be used together. Such similarity information between 
categories can be properly used to improve the power of two-sample tests. 

We assume that a distance matrix has been given on the set of categories, 
and adopt a graph-based approach proposed by Friedman and Rafsky [1979 and 



Rosenbaum 2005) , where a graph is constructed on all subjects so that subjects 



more similar in value are connected by an edge. Friedman and Rafsky's test is 
based on a minimum spanning tree (MST), and Rosenbaum's test is based on 
minimum distance pairing (MDP). The test statistic in both cases is the number 
of edges connecting subjects from different groups. The underlying rationale 
is that, if two groups come from the same distribution, subjects coming from 
the same group should be as distant to each other as subjects coming from 
different groups. More details of these tests are given in Section [3} Both tests, 
however, require uniqueness of the underlying graphs. When the distance matrix 
on subjects is filled with ties, which is characteristic of categorical data, neither 
approach can be directly applied. 
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Ties in the distance matrix lead to ambiguity in constructing the MST or 
MDP, and the number of possible graphs increases rapidly with the number of 
ties. Some efforts were made to address this problem. In the analysis of a partially 
ranked data set with 38 subjects in 23 categories, |Critchlow| [1985] tried both the 
graph obtained from the union of all MSTs (uMST), and the graph obtained 



from the union of all nearest neighbor graphs (uNNG). Nettleton and Banerjee 



20011 also used uNNG on a binary clinical feature data set with 64 subjects in 63 



categories. In general, nearest neighbor graphs do not work well for categorical 
data, see Section [4j In this paper, Critchlow's method using the uMST is studied 
in more detail and a computationally tractable form for categorical data is given. 
A different statistic, based on averaging over all optimal graphs of a certain kind, 
is also proposed and analyzed. 

In Section [4j analytically tractable forms of the two statistics based on av- 
eraging over and union of minimum spanning trees are derived and compared 
via simulation to statistics based on MDP and uNNG. While the two MST- 
based tests are shown to be more powerful than the MDP- and uNNG-based 
tests, neither the averaged nor the union-based statistic dominate in power for 
the simulation scenarios explored. Algorithmic details for computing these two 
statistics are described, and in particular the averaged statistic is shown to be 
computationally intractable for some problems. A generalized version of the av- 
eraged statistic, with better computational properties, is proposed. In Section [5j 
the graph-based approach is illustrated on real and simulated data examples, and 
shown to have much better power than Chi-square tests. In Section [6j permuta- 
tion null distributions of the proposed statistics are described. After mean- and 
variance- standardization, the statistics are shown to be asymptotically normal, 
under certain assumptions on the cell counts and the graph's structure, as the 
number of observed categories goes to infinity. 



2 Notations 

We start by introducing our notations. The different categories are indexed by 
1,2, ... ,K. The naming of the categories is arbitrary, that is, category 1 is not 
necessarily closer in distance to category 2 than to category 3. The two groups 
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are labeled a and b. The data is given in the form of a two-way contingency 
table (Table [l]). Without loss of generality, we assume that each category has at 
least one subject over the two groups. That is, categories with no observation in 
either group can be omitted from the analysis without loss of information. 

Table 1: Basic Notations. 
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Sometimes, we refer to individual subjects themselves, which we denote by 
Yi, . . . , Yjsr. Thus, each Yj takes value in {1, . . . , K} and has a group label 

{a, if Yi belongs to group a; 
(1) 
b, if Yi belongs to group b. 

We assume that a distance matrix, {d(i,j) : i, j = 1,.. . ,K} has been given on 
the set of possible categories, with d(i,j) small if categories i and j are similar. 
Possible ways of defining the distance matrix are shown for various examples in 
Section [TJ 

A graph G is defined by its vertices and edges. We use G to refer to both 
the graph as well as its set of edges, when the vertex set is implicitly obvious. | • | 
is used to denote the size of the set, so \G\ is the number of edges in G. For any 
node i in the graph G, we use Ef to denote the set of edges in G that contain 
node i, to denote the set of nodes in G that are connected to node i by an 
edge, and Ef 2 to denote the set of edges in G that contain at least one node in 
Vf ■ For any event A, I a is the indicator function that takes value 1 if A is true 
and otherwise. 

Following is a list of abbreviations for different types of graphs and test 
statistics: 
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MST: Minimum Spanning Tree, 
MDP: Minimum Distance Pairing, 
NNG: Nearest Neighbor Graph, 

uMST: The graph obtained by taking the union of all MSTs, 

uNNG: The graph obtained by taking the union of all NNGs, equivalent to the 
graph connecting each point to all of its nearest neighbors, 

Rg- The test statistic on the graph G, 

Rmst- the test statistic averaging over all test statistics computed on each of 
the MSTs, 

i?aMDp: the test statistic averaging over all test statistics computed on each of 
the MDPs, 



3 A Review of Graph-Based Two-Sample Tests 

By graph-based two-sample tests, we refer to tests that are based on graphs with 
the subjects {Yj} as nodes. We here suppose {Y} take distinct values such 
that the graphs mentioned below can be constructed uniquely. The graph can 



be constructed using the distance matrix on {Y}- Friedman and Rafsky 1979 
proposed the first graph-based two-sample test as a generalization of the Wald- 
Wolfowitz runs test to multivariate settings. Their test is based on a MST on 
the subjects, which is a spanning tree connecting all subjects that minimizes the 
sum of distances across edges. The Friedman- Rafsky test is based on the number 
of edges connecting subjects across different groups: 

I 9i¥=9ji ( 2 ) 

(vj')eG 

where G is the MST. The statistic above is standardized to have mean zero 
and variance one, and its value is compared to the null distribution obtained by 
permuting the group labels. Friedman and Rafsky showed that, while this test 
has low power in low dimensions, it has comparable power to likelihood ratio 
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tests in a numerical study of normal data in > 20 dimensions, and higher power 
when the normal assumption is violated. 

Another graph-based two-sample method, the cross-match test, was pro- 



posed by Rosenbaum |2005 . This test is based on minimum distance non- 
bipartite pairing (MDP), which divides the N subjects into N/2 (assuming N is 
even) non-overlapping pairs in such a way as to minimize the total of N/2 dis- 
tances between pairs. For odd N Rosenbaum suggested creating a pseudo data 
point that has distance with all other subjects, and later discarding the pair 
containing this pseudo point. The sum ^ is computed with G set to the MDP. 
The test statistic is the mean- and variance- standardized version of this sum. 
Note that the topology of the MDP does not depend on the distance matrix, 
with each node always having degree 1. This fact makes the test based on MDP 
truly distribution-free under the null hypothesis. 

Both methods assume uniqueness of the type of graph used. For categorical 
data, ties appear in the distance matrix whenever a category has multiple counts. 
Even sparse contingency tables have quite a few cells containing more than one 
subject. The number of possible graphs grows rapidly with the number of ties. 
Thus, Friedman and Rafsky's and Rosenbaum's methods can not be directly 
applied to categorical data. For categorical data, distances are often based on 
qualitative measures, and thus while their relative ranking may be trustworthy, 
their absolute scale is not. Hence, we do not consider methods based directly 
on the distance matrix. While there are many ways to construct a graph based 
on a distance matrix, we limit our study to MST, MDP and uNNG, which are 
representative. Figure [T] illustrates the three different types of graphs on a simple 
example containing six points. These six points take on six distinct values. 



4 Generalized Graph-Based Test Statistics 

One natural solution, when the optimizing graph is not unique, is to average 
the test statistic over all graphs of the given kind. In this section, we consider 
the statistic based on averaging the sum ^ over all MSTs (R&hst)- Another 
solution to non-uniqueness it to take the union over all optimizing graphs, such 
as the statistic based on the uMST (-R u mst)- Rmst and -R u mst are analytically 
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Figure 1: Illustration of MST, MDP, and uNNG on six points. Notice that there are two 
possible MSTs on the six points and only one is shown. 



tractable and intuitively appealing, and their derivations are shown in Section 
4.1 For comparison, we also consider the statistic based on averaging ([2]) over 
all MDPs, i?aMDP> an d the statistic based on uNNG, -R u nng- Computation of 
-RaMDP, described in Appendix [A] is often intractable. Computation of uNNG is 



instantaneous. In Section 4.2 we study by simulation the performance of four 
graph-based statistics, i?aMST, -Rumst> -RaMDP) -Rumngj comparing them to each other 
and to Chi-square tests. Our results show that tests based on minimum spanning 
trees have best power, and the intuition for this is explained. Computation of 
i?aMST and R u mst is described in more detail in Section |4.3[ When the number of 
MSTs on categories is large, which is common for categorical data, computation 
for .RaMST can be very costly. We generalize the statistic based on .R a MST to a 



similar but simpler form in Section 4.4 



4.1 The Test Statistics Based on MST 

4.1.1 -RaMST 

First, we define more notations. For each k = 1, . . . , K, let C {1, . . . , N} be 
the subjects that belong to category k. From Table [TJ \Ck\ = mf.. Let 7/j be the 
set of all spanning trees for Ck- Since the distance between any two subjects in 
Ck is zero, any spanning tree of Ck is a MST of Ck- Let Tq be the set of all MSTs 
on the categories. We can embed each tree in 7q* as a graph on the subjects by 
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Figure 2: Embedding the MST on categories on the subjects. This figure only shows 3 
out of 15552 possible embeddings. 



randomly picking one subject in Ck to represent category k, for k = 1, . . . , K. 
For each Tq €Tq, there are 



K 

n 

fc=i 



in 



K°\ 



(3) 



different embeddings. For example, Figure [2] shows 3 out of 15552 (= 2 • 3 3 • 1 • 
4 2 • 3 2 • 2) possible embeddings for a MST on six categories containing 2, 3, 1, 4, 
3 and 2 subjects. Let 7o be the set of all graphs obtained from embedding a tree 
from Tq on the subjects. Then 

/ K T* \ 



1751 



e n 



(4) 



Let 7" be the set of all MSTs on the N subjects. Then, any member of T 
can be represented as a union of a graph from 7o and a graph from each of 
{7fc : k = 1, . . . , K}, and vice versa. Thus, 

T = < r U ( (J Tit) : to € 76, Tfc e Tfe, A; = 1, . . . , K > , 



fc=i 



with 



where £„ 



A' 



|T| = |Toin^> 



(5) 



fc=l 



m 



m-2 



is the number of spanning trees on m points by Cayley's 
formula. Then, the test statistic based on averaging all MSTs on subjects can 
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be defined as: 

-RaMST = |T| 1 ^ R T , (6) 

where i? T is §2^ with G = t. The following theorem gives a computationally 
tractable form for i? a MST i n terms of the cell counts of the contingency table and 
the set of possible MSTs on categories. 

Theorem 1. The test statistic based on averaging over all MSTs on subjects is 
K K * 
Kaffir =2^ 2^ H m k 2^ — • (7) 

fe=l T o eT fe=l £Tq 



Proof. 



-RaMST — | T| 1 -Rt 

= I^ _1 E E"-- E [Rr„+ J Rr 1 + --- + J Rr A .] 

to 67b n £71 t k gTk 

= 17-01- E^o + E 



Toe7o A;=l 



E R-Tk/Sm-k 

T k eT k 



(8) 



First consider the quantity ^ Tfc eT fc ^fc/'^W ■ Since all pairs of subjects in a given 
category have the same distance (= 0), the edge between them should appear in 
the same number of trees. There are in total m^m^ — l)/2 possible pairs and 
each spanning tree for Ck has nik — 1 edges. Hence, the edge between each pair 
of subjects in Ck appears in exactly 

mk(m k - l)/2 m k 

trees. Thus, 



s~ - iL W« s - ~^ k ~- w 

r k eT k mk i.; r . c, .,■ ., mk 



Next consider the summation over 7o- For any i £ C u , j 6 C v , if (u, v) £ Tq, then 
the edge appears in 

K T 5 



k=l 
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elements in To, since any of the m u m v possible edges connecting categories u and 
v appear in equal number of graphs in To- Thus, 



me7o 

_ Np Y\ K V n au n bv +n av n bu / ln \ 

— 2^ r *eT * 11 fc=l 2^(u,u)e<r * m u m„ " V iu J 

Combining (|9|> and ([lOj| gives @. □ 

The following corollaries show that i? a MST has a much simpler form if there is 
a unique MST on categories, or if the total number of subjects in each category- 
is the same. 

Corollary 1. When \T *\ = I, then 

K 



j-, \~~* ^ lt ak' l 'bk \~~* "au'w T /-i -i \ 

XaMST = > ^ V > , (11J 

^ m k m u m v 



^^ak^bk . V - ^ Tlaufl'bv ~\~ ^av^bu 

m k J ^ 

k=l (u,v)£tq 

where Tq is the unique MST on categories. 
Corollary 2. When mk = m, k = 1, . . . , K, 

jw=2^ - +|T | 2^ Z> ^2 • ( 12 ) 

fe=l r (t e7 0* 0.f)GT* 



The form (11) of the statistic is especially intuitive. For each category k, 
we call the term 2n fcri{,fc/mfc the mixing potential of the category. The mixing 
potential is maximized if ra a /% = = mjt/2, that is, when the subjects in 
category k are evenly divided between groups a and b; it is minimized when the 
category contains subjects from only one group. A mixing potential for each edge 
(u,v) can also be defined as (n^n^ + n av nbu) / '{m u m v ) . The edge-wise mixing 
potential is maximized when the edge connects a category containing only group 
a subjects with a category containing only group b subjects; it is minimized when 
both categories contain subjects only from one group. Thus, mixing potentials 
over categories and over edges between categories measure the similarity between 
the two groups. Corollary [T] shows that, when the MST on categories is unique, 
the test statistic -RaMST reduces to the sum of mixing potentials over nodes and 
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edges of the MST on categories. The similarity information on the categories is 
explicitly incorporated into the test through the sum of mixing potentials over 
the edges between categories. 

In testing, the sums ([7]), ( |11[ ) and (12) must be compared to their permuta- 



tion distributions. A generalized statistic that we propose later in Section 4.4 



is 



based directly on (11). 



4.1.2 -RuMST 

Following the notation from the previous section, let M.q denote the set of edges 
appearing in at least one MST on categories. That is, 

M* = {(n,v)er^.^ey}. 

In other words, A4q is the uMST with the categories as nodes. When there 
is only one MST on categories, Tq, then A4q = Tq; when there are multiple 
MSTs on categories, which is common for categorical data, obtaining A4q is not 
straightforward. Computation of A4q is discussed in Section 4.3 The following 
theorem describes the analytic form of -Rumst given M.q. 

Theorem 2. The test statistic based on uMST is 

K 

RuMST = ^ n akn-bk + ^2 ( 
k=l {u,v)&Ml 

Proof. Within each category, every pair of subjects is connected, which gives the 



first term of (13). If categories u and v are connected in any Tq G Tq , then each 



point in category u is connected to every point in category v, giving the second 



term of (13). If categories u and v are not connected in any t * £ T *, no edge 



will appear between categories u and v in uMST. □ 

Remark 1. Both R u mst an d Romst are derived from sums of Ig^g^ over edges 
of the uMST on subjects. The main difference between them is that R u mst treats 
all of the edges equally, while Rcmst assigns each edge a weight proportional to 



the number of MSTs on subjects in which the edge appears. Comparing ( 13 ) to 



(11), the denominators in (11) are omitted in (11). Each edge within category 



k appears in |T|/(mfc/2) MSTs, while each edge between categories appears in 
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\T\/(m u m v ) MSTs. Therefore, R u mst puts more weight on between- category edges 
than within- category edges. 

4.2 A Numerical Study 

In this section, the power of the four tests based on -RaMST, -Rumst, -RaMDP and 
-Runng are studied and compared to Pearson's Chi-square and deviance tests on 
simulated data sets. In each simulation, 30 points are randomly sampled from two 
different distributions - N(Q, 1) vs N(l,l), N(0, 1) vs JV(0,4), N(0,1) vb JV(1,4), 
and U(0, 5) vs U(l, 6). The combined sample of 60 points is then discretized into 
12 bins of equal width. The value 12 is chosen so that the average number of data 
points in each category is 5, mimicking the low cell count scenario. The bins are 
ranked by their start positions, and the distance between two categories is defined 
as the difference in their ranks. The p- values for all tests are calculated through 
1,000 permutation samples for each simulation run, and the power is obtained 
from 1,000 simulation runs. In Figure [3j power is plotted versus type I error for 
each test and each simulation setting. Pearson's Chi-square and deviance tests 
give very similar results, so only the results for the deviance test are shown. The 
deviance test is denoted by "LR" since it is based on the log-likelihood ratio. 
Power for all tests at the two most commonly used significance levels - 0.01 and 
0.05 - are listed in Table H 

First, compare i? a MST> Ram>p, and -R u nng- Rmsr is always significantly more 
powerful than i? a MDP 5 which in turn is always more powerful than -R u nng • This re- 
sult is intuitive from the definition of the different graphs. Since the MST must 
span the entire data set, K — 1 out of its iV — 1 edges are forced to connect points 
between categories. For MDP, if a category has even number of subjects, the 
subjects in that category would be paired amongst themselves; between-category 
edges is only possible for those categories having an odd number of subjects. For 
uNNG, as long as a category has more than one subject, the subjects in that cat- 
egory would not be connected to subjects from other categories. Therefore, tests 
based on MST make the most use of the similarity information among categories, 
while the test based on -R u nng makes the least use of this information. The sim- 
ulation results show a positive correlation between using similarity information 
and the power of the test. 
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Now, we compare the test based on -Rumst to i? a MST- As discussed in Remark 
[TJ the two statistics use the same set of edges but with different weighting. In 
simulation, the two statistics perform similarly under the three scenarios that 
compare two Normal distributions, while -R u mst has very little power, even much 
lower than i? a MDP and the deviance test, for the comparison of two Uniform distri- 
butions with different supports. When comparing two Normal distributions, the 
similarity between two categories is closely related to the difference of the ranks 
of the categories. That is, the further apart the ranks of the two categories, the 
less similar. However, when comparing two Uniform distributions with different 
supports - [0,5] vs [1,6] - only the ranks at the two ends are informative while the 
middle ranks are not. Since -R u mst puts more weight on between-category edges 
compared to -RaMST, it's power would be lower if the similarity measure among 
categories is not informative. 

Note that of all the graph-based tests, only the test based on .RaMST consis- 
tently outperforms the deviance test. 



N(0,1) 


vs N(l,l) 
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uMST 
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uNNG 
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Pearson 


a = 


= 0.01 


0.523 


0.495 


0.428 


0.234 


0.355 


0.346 
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= 0.05 


0.762 


0.740 


0.679 


0.492 


0.605 


0.605 


N(0,1) 


vs N(0,4) 














a - 


= 0.01 


0.304 


0.321 


0.233 


0.133 


0.165 


0.164 


a - 


= 0.05 


0.558 


0.585 


0.482 


0.382 


0.394 


0.396 


N(0,1) 


vs N(l,4) 














a - 


= 0.01 


0.560 


0.600 


0.434 


0.291 


0.352 


0.345 


a - 


= 0.05 


0.804 


0.824 


0.722 


0.569 


0.632 


0.626 


U(0,5) 


vs U(l,6) 














a = 


= 0.01 


0.354 


0.218 


0.310 


0.155 


0.283 


0.251 


a = 


= 0.05 


0.665 


0.486 


0.607 


0.383 


0.600 


0.552 



Table 2: The power of six tests - four graph-based tests based on i? a MST> -Rumst> -RaMDP; 
-Runng, the deviance test (LR) and Pearson's Chi-square test - under two significance 
levels (a = 0.01, 0.05) and different simulation settings. 

This simulation study is limited and only uses ranked data. We chose this 
study design for its interpretability. Though simple, the results are informative 
and show the advantage of averaged MST over averaged MDP and uNNG for 
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Figure 3: Power versus type I error for tests based on i? a MST, R\msj, Rmdp, the likelihood 
ratio (deviance), and RxmG under different simulation settings. 
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categorical data. Also, averaged MST is better than uMST when the similarity 
measure used to construct the graph is not effective. On the other hand, if the 
similarity measure is effective, the test based on uMST is comparable to, and 
sometimes better than, the test based on averaged MST. Hence, the rest of this 
paper focuses on the two tests based on Rmst and -R u mst- 



4.3 Computational Issues of -R a M ST and i? uM sT 

The analytic forms for i? a MST and -R u mst, (0 and (13), require enumeration of all 
MSTs on categories for i? a MST; and enumeration of all edges in Mq for -R u mst- Let 
M = \Tq\ be the number of MSTs on categories. If the distance matrix between 
categories is continuous- valued, then usually M = 1. Even when the distance 
matrix is arithmetic, M is small enough to be manageable for many problems. 
However, for problems that exhibit certain symmetries, enumeration of the set 
of all MSTs on categories is not computationally feasible. For example, Table 



4.3 lists the values of M for the haplotype association problem in Section 5.2 



assuming that there are no empty categories. In this problem, M is computed 
using the Matrix- Tree Theorem, yielding the formula 

* = s*-*-'n«p{(T)}- 

i=2 

For example, when the length of the haplotype is 6, which is a reasonably short 
length in genetic studies, there are only 64 possible categories, but M is already 
larger than 10 45 . One may argue that in this case, (J7J may be further simplified 
using the symmetry over categories, so that enumeration of \Tq\ is not necessary. 
This is true if all categories are non-empty, but if one or more of the categories are 
empty, the symmetry breaks, while M would still be too large for enumeration. 

Table [4] summarizes the computation time for .RaMST and -R u mst in terms of 
K and M. Consider first the listing of all edges in uMST on categories, A4q, 
which is required for -R u mst- This task can be completed in 0(K 2 ) time through an 
algorithm proposed by Eppstein [1995 . Details of the algorithm are in Appendix 



Bl and its theoretical justification is completed by Chen 2012 . 0{K 2 ) time is 



usually affordable since K is no larger than the sample size. Thus, R u kst is 
computationally feasible for any problem. On the other hand, i? a MST requires the 
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Length of haplotype 


K 


M 


2 


4 


4 


3 


8 


384 


4 


16 


42467328 


5 


32 


2.078 x 10 19 


6 


64 


1.66 x 10 45 



Table 3: The number of categories, K, and the MSTs on categories, M, as haplotype 



length increases for the haplotype association problem in Section 5.2 All categories are 
assumed to be non-empty. 



enumeration of all MSTs on categories, not just their edges, and thus adds O(M) 
computation time to the algorithm. For the haplotype example, this makes i?aMST 
computationally infeasible. Thus, in the next Section, we propose a statistic that 
is motivated by i? a MST but that is computationally tractable for all problems. 





Task 


Computation Time 


-RaMST 


Enumerating all MSTs on categories 


0(K 2 + M) 


-RuMST 


Listing edges in uMST on categories 


0{K 2 ) 



Table 4: Computational time for i?aMST and i? U MST- M is the number of MSTs on cate- 
gories. 



4.4 A Fast Method Generalized from -RaMST 

Corollary [T] gives a simple and intuitive form of i? a MST when there is a unique MST 
on categories. In that special case, i?aMST is the sum of mixing potentials com- 
puted within each category and mixing potentials computed between categories 
that are connected by an edge of the MST Tq. Evidence against the null increases 
if this sum of mixing potentials is small, as compared to random permutation. 
In (11), the MST Tq serves as an enumeration of the pairs of categories that are 
highly similar. There is nothing sacred about the choice of MST for this role. 



The intuitive interpretation for (11) remains if we replace Tq by any other graph 
Co that represents proximity between categories. 

Up to this point, we have assumed that a distance matrix on categories is 
used to represent the similarity between categories. We now bypass the distance 
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matrix and assume that similarity is directly represented by a graph Co with 
the categories as nodes. Our goal is to incorporate the proximity information 
encoded by the graph into the two group comparison. We propose the following 



statistic, which we call Rc , obtained by substituting Co for Tq in (11), 

K 

_ V - ^ ^ n ak n bk _|_ n au^bv ' n av^bu (14) 

fc=i (u,v)ec 

The above statistic has a similar interpretation to -RaMST: Consider all Co-spanning 
graphs, which are graphs on subjects where every pair of subjects are connected 
by a path if they are in the same category or they are in two categories that 
are connected by a path in Co- Hence, minimum distance Co-spanning graphs 
connect subjects within categories by spanning trees and connects exactly one 
pair of subjects between each pair of categories that have an edge in Co- Rc is 
the averaged sum Q over all minimum distance Co-spanning graphs. 

If Co is given, computing Rq only requires 0(K + \Cq\) time. If Co is 
not given, the choice of Co can often be guided by domain knowledge. In the 
examples below, our choices for Co include the uMST on categories, which we 
denote by C-uMST (same as Mq), and the uNNG on categories, which we denote 
by C-uNNG. Since C-uMST and C-uNNG can both be computed in 0(K 2 ) time, 
-Rc-umst and -Rc-unng require only 0(K 2 ) computation time for any problem. 



5 Examples 

In this section, the application of .Rc-umst> -Rc-unng and -R u mst are illustrated on 
several examples, both real and simulated. In the simulated examples, their 
power are compared to Chi-square tests. The p-values for all tests are calcu- 
lated through 1,000 permutation samples for each run, and the power calculated 
through 1,000 simulation runs. 



5.1 Preference Ranking 

Consider comparing two groups of subjects on the ranking of four objects. Let 
S be the set of all permutations of the set {1, 2, 3, 4}. Data are simulated under 
the following model: Subjects from group a have no preference among the four 
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objects, and so their ranking is uniformly drawn from S. The rankings of subjects 
from group b are generated from the distribution 

Pe(C) = exp{-0d(C, Co)}, C, Co G 3, 8 G M, (15) 

where •) is a distance function and ^ a normalizing constant. This probability 



model, first considered by Mallows [1957 with Kendall's or Spearman's distance, 



favors rankings that are similar to a modal ranking Co if & > 0. See Diaconis 



1988] for more discussions. The larger the value of 8, the more clustering there 



should be in group b around the mode Co- We experimented with both Kendall's 
and Spearman's distance and various values for 8. We assumed that the true 
distance function used to generate the data is either known and used to construct 
the graph, or unknown, in which case an incorrect distance is used. 

Figure [4] shows C-uMST and C-uNNG formed on a typical data set generated 
under 8 = 5 with n a = rib = 20. Spearman's distance is used in both the 
generating model and for constructing the graph. In this particular example, 
C-uMST contains all edges in C-uNNG with three extra edges, shown in thinner 
lines. The reason this happens is that no category is as close to category "3241" 
as category "3142", and no category is as close to category "3142" as category 
"3241" . For MST on categories, more edges are needed to form a spanning tree. It 
is clear that in this case, there are three MSTs on categories, each one obtained 
by adding one of the three thinner edges to the C-uNNG. In most simulation 
runs, C-uMST and C-uNNG are the same, while in those runs where they differ, 
C-uNNG is always a subset of C-uMST. 

Figure [5] shows the power versus type I error for 8 = 5, n a = n& = 20 
under different combinations of using Kendall's or Spearman's distance for the 
generating model and for constructing the graph; and Table [5] lists the power 
under two most commonly used significance levels - 0.01 and 0.05. We see that 
even when a wrong distance is used, the graph-based tests still have significantly 
higher power than the Chi-square tests. For this simulation setting, -R u mst is the 
most powerful among the three graph-based tests; .Rc-umst and -Rc-unng perform 
similarly with -Rc-umst a little better in all cases, implying that the extra edges 
in C-uMST do give additional useful information. 
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C-uMST C-uNNG 




Figure 4: C-uMST and C-uNNG constructed on a typical data set generated under 
parameters Co = 1234 and 9 = 5 with n a = rib = 20. The Spearman's distance is used in 
both the generating model and for constructing the graph. Each node is labeled by the 
ranking it represents, followed by the number of subjects from groups a and b with that 
ranking in parentheses. 



KK 


uMST 


C-uMST 


C-uNNG 


Pearson 


LR 


a = 


0.01 


0.566 


0.413 


0.397 


0.214 


0.206 


a = 


0.05 


0.784 


0.660 


0.648 


0.450 


0.439 


KS 


a = 


0.01 


0.567 


0.395 


0.385 


0.221 


0.209 


a = 


0.05 


0.784 


0.649 


0.631 


0.455 


0.437 


SS 


a = 


0.01 


0.588 


0.491 


0.478 


0.247 


0.240 


a = 


0.05 


0.807 


0.715 


0.703 


0.485 


0.480 


SK 


a = 


0.01 


0.607 


0.495 


0.486 


0.253 


0.248 


a = 


0.05 


0.811 


0.729 


0.715 


0.494 


0.481 



Table 5: The power of five tests - three graph-based tests based on i? U MSTj -Rc-umst, -Rc-unng 
and two Chi-square tests - under two significance levels (a = 0.01, 0.05) and different 
simulation settings. 
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Figure 5: Power versus type I error for five different tests in the preference ranking 
example with 9 = 5 and n a = m = 20. One of two distance measures (Kendall's or 
Spearman's distance) was used for the generating model and for constructing the graph. 
The title of each plot denotes the choice of distance: The first letter denotes the distance 
used in the generating model ("K" is Kendall's and "S" is Spearman's distance); and the 
second letter denotes the distance used in constructing the graph. For instance, "KS" 
in the bottom left panel means that Kendall's distance is used in the generating model, 
but Spearman's distance is used in constructing the graph. 
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5.2 Haplotype Association 

In this example, we consider a disease model where the probability for disease 
depends on the haplotype at four single nucleotide polymorphisms (SNP). We 
encode the allele at each SNP as or 1 , and so the haplotype can be represented as 
a binary string. We assume that the disease probability depends on the number 
of positions at which the subject's haplotype agrees with a target haplotype: 

P(Disease) = 0.3 + 0.1 x (Number of positions in agreement). 

Thus, the probability of disease can take values 0.3 0.4, 0.5, 0.6 or 0.7 depending 
on whether there are 0, 1, 2, 3 or 4 positions in agreement. To make the problem 
harder, we assume that seven non-informative SNPs are analyzed together with 
the four informative SNPs, and that which and how many of the 11 SNPs are 
informative is unknown in the analysis. Thus the data actually consists of haplo- 
types of length 11. There are 2 11 = 2, 048 possible categories. In each simulation, 
1,000 haplotypes with length 11 are generated uniformly from all possible hap- 
lotypes. Each subject with a given haplotype is signed as "patient" or "normal" 
according to the disease model. Since only 1,000 subjects are simulated in each 
run, not all of the 2,048 categories are represented. The number of non-empty 
categories in each run ranged from 755 to 823, with an average of 791 in the 1000 
simulation runs. The Hamming distance is used to construct the graph. Figure 
[6] shows the power versus type I error plots for the five tests. It is clear that, 
by incorporating the information in the graph, tests based on -R u mst, -Rc-umst and 
-Rc-unng ah have much higher power than the Pearson's Chi-square and deviance 
tests. Among the three graph-based tests, the one based on -R u mst works a little 
better than the ones based on Rc-mst and -Rc-unng- 



5.3 Binary Clinical Features 

This example comes from Anderson et al. [l972| and Nettleton and Banerjee 



20011. Data on the presence or absence of 17 clinical features of the eye ailment 



Keratoconjunctivitis Sicca (KCS) are given for two groups of patients. A ques- 
tion asked by Nettleton and Banerjee was whether the two groups of patients 
share a common distribution with respect to these clinical features. The sizes of 
the groups are 40 and 24. It turned out that only two subjects had the same 
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Figure 6: The power versus type I error plots for the five tests for the haplotype example. 
The length of the haplotype is 11, but only 4 positions informative. 
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outcome for the 17 clinical features, so there are in total 63 distinct categories. 
The Hamming distance is used to construct the graph, and p-values are calcu- 
lated through 10, 000 permutation samples and shown in Table [6} Nettleton and 
Banerjee's method is based on the uNNG on subjects. As discussed before and 
confirmed by simulation studies in Section [4.2[ the uNNG on subjects has lower 
power than MST based tests when many categories have more than one subject. 
This is not a problem in this data because only one category has more than 
one subject. We see that -Rumst, -Rc-umst and -Rc-unng all detected the difference 
between the two groups of patients, while the Chi-square tests did not. 



-RuMST 


-Rc-uMST 


-Rc-uNNG 


Nettleton and Banerjee's 


Pearson 


LR 


0.0011 


0.0010 


0.0006 


0.0007 


0.5200 


0.5200 



Table 6: P-values for the KCS data set. 



6 Permutation Distributions of the Test Statistics 



Based on the results in Sections 4.2 4.4, we focus on -Rc-umst and -R u mst- In this 



section, we consider the permutation distributions of these two statistics in their 
generalized forms. That is, we consider Rq and Tq,, the latter defined as 

K 

Tc =S ^n au n bu + ^2 (n au n bv + n bu n av ) (16) 
«=i (u,v)eCo 

7c-umst is equivalent to -R u mst- 

We define two quantities that will be used to characterize the permutation 
distributions: 

A := max|£ u °|, the maximum node degree in Co- (17) 

u 

(3 := maxm u , the maximum total count for a category. (18) 

u 

By permutation distribution, we are referring to the distribution of the statistic 
under random uniform permutation of the group labels. This is used as the null 
distribution to assess statistical significance. We use P P , E P and Var P to denote 
the probability, expectation and variance under the permutation null. 
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6.1 R Co 

The following lemma states that the first two moments of Rc under the permu- 
tation null can be computed instantaneously using basic summary statistics of 
the graph and cell counts of the contingency table. 

Lemma 1. The mean and variance of Rc under the permutation null are 

E P [R Co ] = (N-K + \C \)2pi, (19) 

/ K \ S |2 K \s \\ 

Var P [R Co ] = 4(px - p 2 ) [N - K + 2|C | + £ _ £ ijil j (20 ) 

V u=l u u=l u / 



where 



+ (Qp2-4p 1 )(k-Y j —\+ P2 V 

\ ^-^ m u / ^-^ m u m v 

\ u=l U J (u,v)eC 

+ {N _ K + \ Co \)\ P2 -Ap 2 1 ), 



n a n b _ 4n a (n a - l)n b (n b - 1) 

Pl ~iV(iV-l)' P2 ~ N(N -l)(N -2)(N -3)' [ ' 



Proof of Lemma [T] is given in Appendix |C.1| 
Remark 2. As N — > oo, n a /N — > 7 6 (0, 1), we have p 2 = kp\ and thus 

Var P [R Co ] = 4(p! - p 2 ) (n - K + 2|C | + £ ^ - £ ^) 

\ u=X u u=l u / 

+ (6p 2 -4 Pl ) (k- V— \ +p 2 V . 

\ ^-^ m,, / m,,m v 
\ u=i U J (u,v)eC 

Furthermore, if 7 = 0.5, £/ten pi = = 1/4 and we have 



Var, 



if K 1 \ 1 1 

\ w =i V (WleCn " " 



(«,t))ec 

We next give sufficient conditions guaranteeing the convergence to normality 
of Rc Q after standardization by its mean and variance. 

Condition 1. 

K 

Y,m u (m u + \8^\){m u + £ m, + |^|) ~ (if 3 / 2 ), 

W=l DGVu 
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{m u +m v +\£^°\+\£^ \){m u +m v + £ m„, + |^ 2 | + |^|)~ (^ 3 / 2 ). 
(«,v)eC tee(V„uv,,) 

Condition [I] constrains the size of "hubs" in the graph: The node degrees in Cq 
and the number of observations in each category must not get too large. It can 
be simplified to stronger conditions that are easier to comprehend. For example, 
the following implies Condition 1: 

Condition I". /3 6 A 2 and A 8 are both o(K). 

The second condition is usually trivial: 

Condition 2. N, \C \, and E(u,v)eC are aU °( K )- 

The asymptotic distribution of the standardized form of Rc is given in the 
following theorem. 

Theorem 3. Assume that conditions^ and^ hold. Under the permutation null, 
the standardized statistic 

Rep - Epfficp] 
y/Var P [Rc ] 

converges in distribution to N(0, 1) as K — )■ oo and n a /N is bounded away from 
and 1. 

The proof of Theorem [3] is given in Appendix |C.2| 

Theorem [3] can be applied to any type of graph, allowing for repeated ob- 
servations of each node. Since the statistics in Friedman and Rafsky 1979 and 



Rosenbaum (2005 do not allow ties, their asymptotic normality results are also 



restricted to the case where each node is observed only once. To compare The- 
orem [3] to its counterpart in these two papers, we let G = Co and assume that 
m u = 1. Thus, 



N = K, V _J_ = |<7 | = 
^ m u m v 



Condition 2 requires that \G\ ~ OiK) and Condition [I] can be simplified to: 



u=l 



(1^1 + I^IXIfigal + ~ o(K^ 2 ). 

(u,v)eG 
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Hence, Theorem [3] implies the asymptotic normality result in Rosenbaum 



2005 



since \£%\ = 1, |£^ 2 I = 1,\G\ = K/2 for MDP. Friedman and Rafsky 
proved a more general condition for asymptotic normality of sums @ aft er stan- 
dardization: For sparse graphs where \G\ ~ O(K), the number of edge pairs 
that share a common node must be O(K). Condition [l] is neither stronger or 
weaker than Friedman and Rafsky's condition. For example, consider a graph 
with one node having degree K 1 / 2 and all other nodes having degree 1; this 
graph satisfies Friedman and Rafsky's condition but not Condition [TJ since 
Y1(uv)eG \^u\\^u,2\ = 0(K 3 / 2 ). On the other hand, a graph with y[K nodes 
having degree K 0,5 and all other nodes having degree 1 would satisfy Condition 
[I] but not Friedman and Rafsky's condition. 

6.2 T Co 

The following lemma is the counterpart of Lemma [T] for Rc - It's proof is given 



in Appendix C.3 



Lemma 2. The mean and variance ofTc under the permutation null are 

( K \ 

Ep[Tc ] = ^m u (m u - 1) + 2 m u m v p 1; (22) 

V=i (u,v)€C J 



K 

Var P [T Co ] = (pi - V2) ^ m «( m « + ^ m v - l)(m u + y~) m v - 2) (23) 
u=i veV u vev u 

( K 

+ (pi - P2/2) ^ rn u (m u - 1) + 2 ^ m u m v 
\u=i (u,v)ec 

(K 

+ (?2 ~ 4pf ) m u (m u - 1) + 2 m u m v 

\u=i (u,v)eC 



where pi and P2 are given in (21). 



The next theorem gives a sufficient condition for asymptotic normality of 
Tc under the permutation null. 
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Theorem 4. If ^2^ = iTn u (m u + J2vev u m v) 2 ~ 0{N), then under the permuta- 
tion null distribution, the standardized statistic 

Tc - E P [T Co ] 
^/Var P [T Co ] ' 



where ~E P [Tc ] and Var P [Tc ] are given in (22) and (23), converges in distribution 



to N(0, 1) as N —> oo, and n a /N bounded away from and 1. 

Proof. Let G be the uMST on subjects. Then as long as J2i=i 1^1(1^1 ~~ 1) 



O(N), asymptotic normality can be ensured following |Friedman and Rafsky 



1979 



's result. Notice that if % is in category u, then \£@\ = (m u — 1) + 

6.3 Checking the P-values Under Normal Approximations 

We now check the normal approximations to the p-values of the three graph- 
based statistics - -Rc-umst, -Rc-unng and -R u mst _ through simulation. We adopt 
the setting of the haplotype example. In each simulation run, iV haplotypes 
with length I are generated uniformly from all possible haplotypes with length I. 
They are assigned to either group with equal probability. Hence, the two groups 
have the same distribution. For each simulation run, we calculate the difference 
between theoretical p-values from the normal approximation and the permutation 
p-values from 10,000 permutations for the three statistics. We consider different 
sparsity settings by varying I, which controls the number of categories, and N. 
Under each setting, 100 simulation runs are done, with boxplots of the differences 
between theoretical and simulation p-value shown in Figure [7j We increased / 
from 6 to 10, and thus the number of possible categories considered grows from 
64 to 1024. The sample size N varies from 100 to 1000. This spectrum of values 
is reasonable for a genetic association study. 

Simulation results under this setting shows that the normal approximation is 
better for -Rc-umst and -Rc-unng than for R u mst- Accuracy of normal approximation 
improves for all statistics as I and N increase. For -Rc-umst and -Rc-unngj when the 
number of possible categories is larger than 256 and the number of observations 
is larger than 200, the p-value from normal approximation is quite accurate. 
While for i? U MST, the number of observations needs to be larger than 500 to 
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achieve similar accuracy. For Rtmst, when the number of possible categories is 
larger than the number of observations, the p-value calculated from the normal 
approximation is negatively biased, and thus less conservative. The bias is less 
severe for -Rc-umst and -Rc-unng, while still problematic when the number of possible 
categories is 1024 and the number of observation is only 100. Skewness correction 
can be done to make the theoretical p-values more accurate, but when N is so 
small, it would be easier to just do permutation directly. 



7 Conclusions and Discussion 

We have described a new approach for comparing two categorical samples, which 
is appealing when the contingency table is sparsely populated. Sparse contin- 
gency tables are common in many modern applications where the number of 
categories, K, is large compared to sample size. In such situations, the different 
categories can usually be related to each other in a systematic way. The new 
approach utilizes a graphical encoding of the similarity between categories to 
improve the power of two-sample comparison. We showed, through simulations 
and real examples, that utilizing graphical information improves the power over 
deviance and Pearson's Chi-square tests. The proposed statistics are shown to be 
asymptotically normal after standardization, under assumptions that limit the 
hub size and density of the graph. This allows instantaneous type I error control 
for large data sets. 

The power of the new approach depends on the choice of an informative 
similarity measure between categories. This part of the analysis should rely on 
domain knowledge that is specific to each application. For ranking data from 



surveys, one can start with the standard distance measures used in Example 5.1 
When the number of categories is large, drawing relationships between categories 
is a necessary and often default step in analyzing the data. 

Both -Rc-umst and -R u mst work well when the similarity information is effective 
with -Rumst usually having better power. However, when the similarity measure 
is not as informative, -R u mst can have very low power, even when compared to 
Chi-square tests. In our simulation studies derived from the Haplotype problem, 
the normal approximation is more accurate for -Rc-umst than for i? U MST- For -R u mst, 
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C-uMST C-uNNG uMST C-uMST C-uNNG uMST C-uMST C-uNNG uMST C-uMST C-uNNG uMST 




N=100 N=200 N=5O0 N=1000 

Number ol possible categories: 64 



C-uMST C-uNNG uMST C-uMST C-uNNG uMST C-uMST C-uNNG uMST C-uMST C-uNNG uMST 
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Figure 7: Boxplots for the differences between £>- values calculated from normal approxi- 
mation and 10,000 permutations. 
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p-values obtained from normal approximation are lower than actual p-values in 
extremely sparse situations. All p- value approximations work well when the 
sample size is comparable to the number of categories. 

Generalization of this approach to multi-sample comparison is straightfor- 
ward by letting take K' distinct values, where K 1 is the number of groups. 

A The Test Statistic Based Oil -RaMDP 

We first assume N, the total number of observations, to be even. Let Kq be the 
number of categories containing an odd number of subjects. Since iV is even, Kq 
is even. [Kq can be 0.). Without loss of generality, let categories 1, . . . , Kq be the 
categories containing an odd number of subjects, and categories Kq + 1, . . . ,K 
be the categories containing an even number of subjects. More notations are 
defined below. 

• A = {x = (xx, . . . , xk q ) T '■ %i 6 {a, b}, i = 1, . . . , Kq}: all possible combina- 
tions of group identities of the subjects with one from each of the categories 
containing an odd number of subjects. 

• i?o(ra a , rib): the number of edges connecting subjects from different groups 
averaged over all perfect pairings of n a points from group a and n& points 
from group b in the same category, with n a + n& being even. 

• -R x , x£i: the number of edges connecting subjects from different groups 
averaged over all MDPs on categories 1, . . . , Kq. 

Assumption 1. If a category has an even number of subjects, the subjects are 
paired within the category. 

Assumption [l] is usually true for MDP on subjects for categorical data. It 
is explicitly stated here to avoid the complicated scenario when the triangle 
inequality becomes equality in the distance metric for any three categories. 

Proposition 1. Under Assumption [71 the test statistic based on averaging ^ 
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over all MDPs is: 



K 1 I K ° 

RaMDP = E Ro{n a k, n bk ) + — ^ ^ I \\ n Xii 

k=K {1 +l 1 lfe=i mk xeA I i=l 



K 

Rx + ^2 Ro(n Xjj -l,n x c 

3=1 

(24) 



where x\ 



b if Xi = a 
a if Xi = b 



Ro(n a ,n b ) = Em ° 
its \ 1 



n b 
i 



i\ (n a -i- 1)!! (nt-i- l)!!/(n a + n 6 - 1)!! 

(25) 



with 



and 



S = 



{0, 2, . . . , n a A n;,} i/ n a and n;, 6oi/t even 
{1, 3, . . . , n a A rife} i/ n a and n b both odd 



R x =\n*\- 1 E E w*,-> (26) 

w*£SJ* (i,j)6a)* 

where co* is an MDP on categories 1, . . . , Kq, and fi* is the set of all these uj* 's. 

Proof. First consider the simpler case: One category with n a subjects from group 
a and n b subjects from group 6, with n a + n b even. Since all subjects are in the 
same category, any perfect pairing is an MDP. There are in total (n a + n b — 1)!! 
different perfect pairings. 

When both n a and n b are even, the possible numbers of edges connecting 
different groups are: 0, 2, ... , n a An b . Among all the (n a +n b —l)\\ perfect pairings, 
the number of perfect pairings having i G {0, 2, . . . , n a A n b } edges connecting 
different groups is 



n a \ I n b 



i\ {n a -i-l)\\ {n b -i-l)\\. 



When both n a and n b are odd, the possible numbers of edges connecting different 
groups are: 1, 3, . . . , n a A n b . Among all the (n a + n b — 1)!! perfect pairings, the 
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number of perfect pairings having i 6 {1, 3, . . . , n a Arib} edges connecting different 
groups is also 




n b 
i 



i\ {n a -i-l)\\ (n h -i-l)\\. 



(25) follows immediately. 

Under Assumption[TJ an MDP on all subjects would be an MDP on categories 
Kq, (uj*), embedded on the subjects similar to the MST case, with all other 



subjects paired within each category, so (24) follows naturally. 

□ 

Remark 3. If N, the total number of observations, is odd, we can add a pseudo 
category with one subject, whose distance to any other category is 0. All deriva- 
tions are the same, except that the edge containing the pseudo category is dis- 
carded from the MDP on categories in later steps. 

B Computation Time for i^aMST and i? uM sT 

The main task for computing i? a MST and -R u mst are to enumerate all MSTs on 
categories for Rmst and to list the edges in Ai^ for -R u mst- Other tasks can be 
finished in 0{K) time. 

Let G be the complete graph on the K categories. |G| = K(K — l)/2. 



Eppstein [1995 proposed a graph operation called the sliding transformation 
which, when applied to G, produces an equivalent graph such that the MSTs 
on categories correspond one-for-one with the spanning trees of the equivalent 
graph. The enumeration of all spanning trees, without having to optimize for 
total distance, is relatively straightforward. Thus, we adopted the following 
computational approach: Use Eppstein's method to construct the equivalent 
graph of G, enumerate all spanning trees of the equivalent graph, then transform 
back to get the set of MSTs on G. The sliding transformation constructs the 
equivalent graph in C(|G| + KlogK) = 0(K 2 ) time. To perform the sliding 
transformation, an initial MST is needed. Prim's algorithm can be used to obtain 
the initial MST, which needs 0(K 2 ) time, not increasing the time complexity. 



The theoretical justification of this algorithm can be found in Eppstein 1995 



and Chen 2012] , which completes many of the proofs of Eppstein [T995 . 
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After removing any loops formed during the the sliding transformations, each 
remaining edge appears in at least one spanning tree of the equivalent graph, thus 
appearing in at least one MST on G. Now we have the list of edges in uMST on 
G, and thus -R u mst can be calculated in 0(K 2 ) time. 

For enumerating all spanning trees of the equivalent graph, the algorithm 



proposed by |Shioura and Tamura [1995 is used, which requires 0(K+\G\+M) = 
0(K 2 -\-M) computation time. This was proven to be optimal in time complexity. 
Shioura and Akihisa's algorithm starts from a spanning tree formed by depth- 
first search, then replaces one edge at a time using cycle structures in the graph, 
traversing the space of all spanning trees of the graph. Hence, computing i? a MST 
takes 0(K 2 + M) time. 

C Proofs for Lemmas and Theorems in Permutation 
Distributions 

C.l Proof of Lemma Q] 

Proof. Define 



K 1 

Ra = Y,~ E W;/, 



m, u 

u=l i,j&C u 



and 



r b= E — E 



m u m v 1 — ' 
(u,v)ec iec u ,jec v 



We have 

E P [i? Co ] = E P [R A ] + E P [R B ) 

K 

= E — E Pp (^ ^ 9j) + E — E Pp (^ + 



m u 1 — ' « — ' m u m v 
u=l i,j£C u (u,v)eC ieCu,j£C v 



Since P P (^ / gj) 
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2n a n b 
N(N-l) 



thus 



K 



2n a n b 



(JV-tf + |C |)- 



2n a n b 



N(N-l) 

Now, to compute the second moment, first note that 

E P [i^J = Bp[R 2 A ] + E P [i?|] + 2B P [R A R B ]. 
Expanding the right-hand-side in above, 

k 

E p p(9i^ 9j,9k^ 9l), 



E P [R A ] 



m,,m r 



E P [R 2 B ] 



E- 

E ri^2 E Pp(9i^9j,9k^gi) 

,v)ec u v i,kec u , j,iec v 

+ 2 E — 1 — E 

C 
1 



{(«,«),(^,2/)}cCo V i£C u , j&C v , k€C w , leCy 



Pp(5i 9j,9k / 



K 



E p [RaRb] = Y1 E 
Since 



m„m t m», 
u=i (v,w)ec i,jec u , fcgc„, «ec 



E p p(ff» ^ 9j,9k^ 9l)- 



Pp(5i / Sj.Sfc ^9l) = < 





N(N-l) 



N(N-l) — P 1 



if i 


= j 


and/or fc = 


I 


•• 




= k,j = l,i 


+ 3 






= l,j = k,i 






i 


= k,j / i,Z 




if < 


i 


= l,j^i,k 






3 


= k,i^j,l 








= l,i^j,k 





V N n (N-i)(N-2™(N% = Pi if *> ■?> fc > 1 are a11 different, 



we have 



K 1 fc 1 



777- 

u=i u i,j,k,iec u 



m u m v 

u=lv^u i,j&Cu, k,leC v 
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= ^2 -^[2m u (m u - l)(2pi) + Am u (m u - l)(m u - 2)p 1 + m u (m u - l)(m u - 2)(m u - 
u=i mu 

k 1 

+ y~] m u (m u - l)m v (m v - l)p 2 

^-f ^ m u m v 

= 4 ( N - 2K + V —\ pi + (N — K — 4)(7V — K)p 2 + 6 [ K - V — ^ p 2 , 

Ep [^]= E ^2 E V?(9i^9j,9k^9l) 

(u,v)ec u v i,keCu, j,iec v 

+ E E p p(3i/ 9j,9k¥= 9l) 

+ Y] Y] Pp(5i / 9j,9k¥= 9l) 

z — ' m u m v m w m y ^-^ 

(u,v), (w,y) 6 Co i 6 C u , j e C„ 

u,v,w,y all different k £C W , I & C y 

E ! J m u m v (2pi) + m u m v (m u + m v - 2)pi + m u (m u - l)m v (m v - l)p 2 ] 



(m,^)GCo u v 



+ ^ -^r^ [m u m v m w pi +m u (m u - l)m v m w p 2 ] 

(u,v),(u,w)£Co, v=£w 

V- 1 

+ > m u m v m w m y p 2 

m u m v m w m y 

(u,v), (w,y) 6 C 
u,v,w,y all different 

= V" — - — [K + m„)pi + (m u - l)(m„ - l)p 2 ] 
(w,w)eCo 

+ E ^\Pi + {rn u - l)p 2 ] 

(u,v),(u,w)eCo, Vy^w u 

+ 2|{(u, v), (w, y)} C Co : u, u, y all differentia 
K |^C |2 i 
= E^^^-^) + |Co| 2 P2+ £ P2, 



K 



E P [R A R B ]=<T E E VA9i*93,9k*9l) 
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K 1 

+ Y] Y] Y Y Pp(5i + 9j,9k + 9i) 

m u m v m w ^-^ 
m=i (v,w)EC \e£° i,jec u kec v ,iec w 

K 1 

= y~] y~] [2m u (m u - l)rn v p\ + m u (m u - l)(m u - 2)m v p 2 \ 

^ ^ m A u m v 

K 1 

+ Y] y~] rn u (m u - l)m v m w p 2 

^ ' m u m v m w 

(v,w)<=C \sZ 

= \C \(N -K) P2 + 2( Pl -p 2 ) (2\C \-^) . 

Varp[i?(7 ] follows by combining the above in computing Ep[i?^ ], and then sub- 
tracting Ep[i?c ]. □ 

C.2 Proof of Theorem M 

To prove Theorem [3j we first prove a simpler result: Asymptotic normality of 
the statistic under the bootstrap null, defined as the distribution obtained by 
sampling the group labels from the observed vector of group labels with replace- 
ment. Let P B , E B and Var B denote respectively the probability, expectation and 
variance under the bootstrap null. 

Lemma 3. Assuming condition^ under the bootstrap null distribution, the stan- 
dardized statistic 

Rc - B B [Rc ] 
^Var B [R Co ] 

converges in distribution to iV(0, 1) as K —> oo, where Eb[Rc ] an d Var B [Rc ] 
are given below. 

E B [R Co ] = (N-K+\C \)2p 3 , (27) 
Var B [R Co ] = 4( P3 - p 4 ) f N - K + 2\C \ + £ ^ - £ ^ ) 

\ u=l u u=l u / 

+ (Gp4-4 P3 )(k-Y^—\+p 4 V 1 , 

\ ^-^ m u J m u m v 

\ u=i u / {u,v)ec 

where 

n„nh 4n„nf 9 , . 

P* = i^> w = -^r = ^- (29) 



(28) 



Graph-Based Tests for Categorical Data 38 

The proof of Lemma [3] relies on Stein's method. Consider sums of the form 
W = wnere J is an index set and £ are random variables with E[£i] = 0, 

and .E^ 2 ] = 1. The following assumption restricts the dependence between 

{& = * e J}- 



Assumption 2. Chen and Shao, 2005, p. 17] For each i £ J there exists 



Si C Ti C J such that £j is independent of £s? and is independent of £t= ■ 

We will use the following existing theorem. 
Theorem 5. \Chen and Shao , 2005\ Theorem 3.4] Under Assumption^ we have 



sup \Bh{W)-Bh(Z)\<5 

h£Lip(l) 

where Lip(l) = {h : M. — > M}, Z has M(0, 1) distribution and 



iVi\ 



with r)i = Y2jes- anc ^ ^* = Ylj^T Cj'j where Si and Ti are defined in Assumption 



Proof of Lemma\^ The mean and variance of Rc under the bootstrap null, (27) 



and (28 ), can be obtained following similar steps as the proof of Lemma [TJ noting 
tstrap null, 



that, under the bootstrap null, 

if i = j 

TV 2 



^-2p 3 if 



and 



N 2 



w 



N 4 



2p 3 if 



P3 



PA 



if i = j and/or k = I 
i = k,j = l,i / 3 
i = 1,3 = k,i / j 
i = k,j 

i = 1,3 ^hk 
3 = k,i^ j,l 
3 =l,i¥ z 3,k 
if i,j,k,l are all different . 



if < 
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To prove asymptotic normality, we first define more notations. For any node u 
of Co, let 

Ru = naunbu , d u = B B [R U ] = 2(m u - l)p 3 , 
m u 



where p3 is defined in (29). Similarly, for any edge (u,v) of Co, let 

tiuv — , O-uv — &B[-n-uv\ — *P3- 

m u m v 

Let cTg = Var B [i?(7 ], £ u , £ uv be the standardized mixing potentials, 

e. = (30) 

j. R"UV d uv fni\ 

C,uv = • (olj 

Finally, we define the index sets for £ u and ^ uv : 

Jl = {l,...,K}, 
Ji = {uv : u < v such that (u, v) G Co}, 

/(u,v)eC 



and let J = J\ U J" 2 - Since Rc = J2u=i ^« + J2(uv)ec n tne standardized 



statistic is 

Our notation follows those of Theorem [5] and Assumption [2] For u G J\, let 

S u = {u} U , to : (u, v) G Co}, 

T M = 5 U U to, : (u, u), (v, w) G Co}. 

For uv G J2-, let 

5*™ = {uv, u, v} U {uw, wu : (u, it?) G Co} U {to, row : (t>, «;) G Co}, 

= U {w,wy,yw : (u,w), (w,y) G C } U {w,wy,yw : (v,w), (w,y) G C }. 

S U ,T U , S UV ,T UV defined in this way satisfy Assumption [2j 
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nu,— ll 



Since R u G [0, p 3 G [0, J], and G [0,1], we have d u G [0 , 2 j, 
<4i> G [0, 1], and therefore |£ u | < < ^. Hence, 



V 1^1 < — (m u + \£^\), ueJi, 

(Tr 

— CTr ^ — 

E 101 < — ("it. + m« + + «« e J 2 , 

E ^ — ( m « + m - + E m - + + l^3l)> e ^2. 



0"B 



As in Theorem [5| let r/j = X^'es, an< ^ ^* = SjeT 4 Then 



Eaie^ti = Eb|6 E E &i ^ Eb I&i E l&l E 

j&Si k&Ti jeSi k&Ti 

|E B (e^oiE B |^i < e b |c y, oi e bI E &i ^ Eb I^i E i^-i Eb E i&i 
Eb|^| = Eb|6 E E oai < e b |6I E &i E i&i- 

ieSi fceSi jeSi fees* 

Thus, for i = u G Ji, the terms E B |^r/i0i|, | E B I e b I I , an d E B |^r/j 2 | are all 
bounded by 

-^m u (m u + 

aB ^ev„ 
and for i = uv G J2, the terms E B |Ci?7i^|, |E B (^i»7i) | Eb| 5 and E B |£i7? 2 | are all 
bounded by 

\{m u + m v + \£°°\ + |£„|)(m„ + m v + V m w + + |£§|). 
Hence, 



0"k 



5 '* 



Co i 



5 < -3 E m «( m « + + E mv + 

°" B \u=l !J6V„ 

+ E ( mu + m « + 
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Since <tb is of order \[K or higher, under condition [lj 5 — > as K — > oo. 

□ 

Proof of Theorem^ To show the asymptotic normality of the standardized statis- 
tic under the permutation null, we only need to show that {Rc ,n^) converges 
to a bivariate Gaussian distribution under the bootstrap null, where is the 
number of observations that belong to group a in the bootstrap sample. Then 
asymptotic normality of Rc under the permutation null follows from the fact 
that its distribution is equal to the conditional distribution of Rc given = n a . 
The standardized bivariate vector is 

{ VVarB^Co] ' CT o ) 

with p a = n a /N, Uq = Np a (l — p a ). By the Cramer- Wold device, we only need 
to show that 

R Co -B B [R Co ] nj 3 - N Pa 

a\ — = — V 0,2 

^Var B [R Co ] 

is asymptotic Gaussian under the bootstrap null for all a\,ai G M,aia2 7^ 0. 

Let G J be defined in the same way as in the proof of Lemma [3} Let 
Js = {\J\ + 1, • • • , \J\ + K}. For i G Ja, let 

ii = , i =i-\J\. 

GO 

We use Theorem[5]to show the asymptotic Gaussianity of Yliej a i^+Siej3 a2 ^«- 
We need to redefine the neighborhood sets to satisfy Assumption [2] 
For u G Ji, 

S u = {u,u+ \J\} U {uv, vu : (u, v) G C }, 

T u = S u U {v,t> + iJ^m^uw : (it,u), G C }. 



For uv £ J2, 



S U v = {uv, u, v,u + I J\, v + \J\} U {uw, wu : (u, w) G Co} 

U {to, wv : (v, w) G Co}, 
T uv = S uv U {w,w + toy, yio : (w,y) G C } 

U + 1^1, wy,yu; : (v,w), (w,y) G C }. 
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And for u G ^3, 

S u = {u, u'} U {uv, vu : (u, v) G Co}, u = u — \J\, 
T u = S U L> {v,v + \ J\,vw,wv : (tt',u), G C }. 

From the proof of Lemma [3j we have 

Z0" B <t b 



For u G J3, 



|&|<— , u' = u-|J|. 



Let a = min(o"B, <7o), then 



X 101 < -(2m u + 2m,, + 2 m ro + + |£g>|), uv G J 2 . 



(T 



Thus, for i = it G JiUj 3 , the terms E B |€i»7i^iU |E B (&77i)|E B |0j|, and E B |C^?I 
are all bounded by 

\m u (2m u + \£%°\)(2m u + 2^m v + \£^ 2 \), 



and for i = uv G J2, terms E B |Ci^i^i|> l E B(Ci r ?i)l E B|^|) an d E B |^ry?| are all 
bounded by 

\{2m u + 2m v + \£^\ + \£ v \){2m u + 2m v + 2 V m w + |£f° 2 | + 

«igv„uv„ 

Define W 0lj02 = ^jg^aiCi + Siej" 3 °2£i- The value of 5 in Theorem [5] has 



the form 



1 



Eb[W2 



01,02J 



2^(E B |ai^i| + |E B (oi^)|E B |^|)+^E B |ai^77?| 

v iej iej 



-2 ^(E B |a 2 &77A| + |E B (a 2 e^i)|E B |^|) + ^ E B |a 2 ^f| , . 
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where r/* = EjeS* Zji^jeJ + a 2ljeJ 3 ), and 0, = £j- 6Ti ^ji^jeJ + ^IjeJs)- 
Let a = max(|ai|, | a-2 1 ) , we have 

EB|aiCi?7i^i| 3 E B |a 2 Ci^^l < a 3 E B |& ^ £j E 

< a 3 E B |^| E 1^1 E 1^1' 
|E B (ai^)|E B |^|, |E B (a 2 ^)|E B |^| < a 3 E B |& ^ 0I E b| ^ 01 

^^Eb^I^I^IEb^I^I, 

E B |ai^ 2 |, B B \a 2 Zivf\ < a 3 E B |^ £ ^ &6| 

< a 3 EB|e<| E 1^1 E l&l" 

Thus, 

4n„3 / # 

^ < — , E m "( m « + + E m - + i f Si) 
+ E + m ^ + 

(u,v)ec wgv u uv v j 

Since <r B is at least of order K and a 2 , is of order N, a 2 is at least of order 
K by Condition 2. If E B [W^ ] is uniformly strictly bounded from for any 
a±a 2 7^ 0, then under Condition [TJ 8 — > as K — > oo. 

We next show that under Condition 2, E B [W 2 a ] is uniformly strictly bounded 
from for any a\a 2 ^ 0. 

Let Wi = J2iej 6» W 2 = EieJa 6> * h en 

EbI^,^] = alB B W? + a 2 2 E B Wi + 2a 1 a 2 E B [W 1 W 2 ] 
= a\ + a| + 2aia 2 E B [VFiW 2 ] 

Thus, we only need to show that the absolute correlation between W\ and W 2 is 
uniformly strictly bounded from 1. Notice that, in the theorem, we require n a /N 
to be bounded from and 1, so p a and pb are both bounded from and 1. 
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Correlation between Rc and n^: Observe that 

K 



44 



Rc na 



u =i m "- i,jeC u 

K 



(u,v)eC 



m u m v 



^9%+9j 



V 



N 



E Igx=a 



x=l 



~^m u ^ r ( 7 ^E J ^-) + E mum ^ E U^E4 



For any i / j, 





AT 






E ^= a 


— Eb 









Igi¥=9j,9i=a + ^9i¥=9j,9j=a + E ^9i¥=9j,9x=a 

PB(ffi = a, 3j = 6) + PB(ffi = b, ffj = a ) + E Pb ^ ^ g ^ 9x = a ) 
PaPb + PaPb + 2p a p b p a (N - 2) = 2p a p b (Np a + 1 - 2p a ). 



Hence 



E B [Rconf] = (N-K+ \C \)2 PaPb (Np a + 1 - 2p a ). 
Since E B [i?Co] = (N - K + |C |)2p aPfe and E B [nf ] = Np a , we have 

Cov B (ifco,™f ) = (AT - K + |£ Co |)2p a p 6 (l - 2 Po ). 



(32) 



If p a = 1/2, then Cov B (-Rc > = 0- Since Var B [i?c ] an d VarBfn^] = Np a p b 
are positive, Cor B (i?c ) n a ) = 0> clearly bounded from 1. We consider p a / 1/2 
in the following. 



/ JL \fCo\2 K \fC \ 

Var B [R Co ] = A PaPb (l - Ap aPb ) \N - K + 2\C \ + E " E '" 



tt=l M=l 



+ ip a Pb(GPaPb -1)(k-J2 — I + 4 PaP6 E ~ ~ 

\ m n / ^— ' m u m v 

\ w— l / (u,u)eCo 



A" 



4p a p ft (l - 4 M ) iV - 2K + 2|C | + E 



(l^ Co l/2-l) 2 



+«(V-E— E — 
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Since 



„f (1^1/2 - 1) 2 = f f (Iffll/a - ') 2 > f l^g- ' ? 

^-f m u ^-f m u V i V m ^ 

U=l W=l U=l \ M=l ' 

2 v 2 



' if \ / K \* 

£||ff°|/2-l| > £(|ff°|/2-l) = (\C \-K) 2 , 

KU=1 / \u=l ) 



we have 



Var B [i?c ]Var B [nf ] > 4p 2 a p 2 b (l - 4p aPb )[N -K+ | C 1 ] 2 + 4p 3 aP 3 b N ^ — 
Hence, 



m u m v 

(u,v)ec 



\Cor B (Rc ,n*)\< ' 



{l-ipa Pb )[N-K+\C \Y 

When TV, |C |, E( U)U ) 6 C JS^; ~ °( if )' I Cor B (^c , ™a ) I is bounded by a value 
smaller than 1. 

□ 

C.3 Proof of Lemma [2] 

Let G be the uMST on subjects, and ff = : G G}. Then |ff | = 

w u + Ev„ m « -1 ' l G l = Ef=i m «( 771 «- 1 )/ 2 + E(«^)eC m « m ^- Since E P [T Co ] = 
|G|2pi, and the result follows. 

Now, we compute the second moment. 

(i,j),{k,i)eG 

= ^2 F p(9i7^9j)+ Yl f p(9i 9j,9i 9k) 

+ E F p(9i 9j,9k 9l) 

(i,j),(k,l)€G 
i,j,k,l all different 

N N 

= \G\2 P1 + £ |ff | (|ff | - + (|G| 2 - |G| - ^ |ff | (|ff | - 
i=i i=i 
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4G 



K 



(Pi - P2) ^ m u{ m u + 



m v - 



m t — 2) 



u=i vev u 



V&Vu 



+ (pi-P2/2) J]m u (m u -l) + 2 




«=i O,i>)eC 




2 



Var P [T Co ] follows by EppgJ - E 2 P [T Co ]. 



□ 
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